我的 AI 學習之路：第14天 Gemma 與 Gemini - Gemma3n 音訊 - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2025 iThome 鐵人賽

DAY 14

生成式 AI

我的 AI 學習之路：30天 Gemma 與 Gemini系列第 14 篇

我的 AI 學習之路：第14天 Gemma 與 Gemini - Gemma3n 音訊

17th鐵人賽

kevin_chiu

團隊AI 航海王

2025-09-15 23:38:08

156 瀏覽

分享至

Gemma3n

Gemma 3N（全稱 Gemma 3N-e4b）是 Google 的最新一代多模態開放模型，屬於 Gemma 模型系列的一員。與之前的版本相比，Gemma 3N 最大的亮點在於其強大的多模態能力，讓它不僅能處理文字，還能理解圖片和音訊。
Gemma 3N（全稱 Gemma 3N-e4b）是 Google 的最新一代多模態開放模型，屬於 Gemma 模型系列的一員。與之前的版本相比，Gemma 3N 最大的亮點在於其強大的多模態能力，讓它不僅能處理文字，還能理解圖片和音訊。

Gemma 3N 的主要特點
多模態處理能力：這是 Gemma 3N 最重要的進化。它能夠同時處理文字、圖片和音訊等多種輸入，並根據這些綜合資訊來生成回應。這使得它能夠完成更複雜的任務，例如：

 - 視覺問答 (VQA)

 - 影片分析

 - 語音轉文字

 -多模態對話：在一個對話中同時討論文字和圖片。

開放模型：Gemma 3N 延續了 Gemma 系列的開放策略，讓開發者和研究人員可以免費使用其權重（weights）進行研究和商業應用。

高效能：儘管模型體積小，Gemma 3N 仍然擁有強大的推理和生成能力。3N 代表了它的規模，適合在較少的硬體資源（如單一 GPU）上運行。

Gemma 3n 語音轉文字

範例code

# Install a transformers version that supports Gemma 3n (>= 4.53)
!pip install "transformers>=4.53.0" "timm>=1.0.16"

# Define formatting helper functions
import torch

GEMMA_PATH = "google/gemma-3n-E2B-it" #@param ["google/gemma-3n-E2B-it", "google/gemma-3n-E4B-it"]
RESOURCE_URL_PREFIX = "https://raw.githubusercontent.com/google-gemini/gemma-cookbook/refs/heads/main/Demos/sample-data/"

from IPython.display import Audio, Image, Markdown, display

class ChatState():
  def __init__(self, model, processor):
    self.model = model
    self.processor = processor
    self.history = []

  def send_message(self, message, max_tokens=256):
    self.history.append(message)

    input_ids = self.processor.apply_chat_template(
        self.history,
        add_generation_prompt=True,
        tokenize=True,
        return_dict=True,
        return_tensors="pt",
    )
    input_len = input_ids["input_ids"].shape[-1]

    input_ids = input_ids.to(self.model.device, dtype=model.dtype)
    outputs = self.model.generate(
        **input_ids,
        max_new_tokens=max_tokens,
        disable_compile=True
    )
    text = self.processor.batch_decode(
        outputs[:, input_len:],
        skip_special_tokens=True,
        clean_up_tokenization_spaces=True
    )
    self.history.append({
        "role": "assistant",
        "content": [
            {"type": "text", "text": text[0]},
        ]
    })

    # display chat
    for item in message['content']:
      if item['type'] == 'text':
        formatted_prompt = "<font size='+1' color='brown'>🙋‍♂️<blockquote>\n" + item['text'] + "\n</blockquote></font>"
        display(Markdown(formatted_prompt))
      elif item['type'] == 'audio':
        display(Audio(item['audio']))
      elif item['type'] == 'image':
        display(Image(item['image']))

    formatted_text = "<font size='+1' color='teal'>🤖<blockquote>\n" + text[0] + "\n</blockquote></font>"
    display(Markdown(formatted_text))
    
# Load Model    
from transformers import AutoModelForImageTextToText, AutoProcessor

processor = AutoProcessor.from_pretrained(GEMMA_PATH)
model = AutoModelForImageTextToText.from_pretrained(GEMMA_PATH, torch_dtype="auto", device_map="auto")

print(f"Device: {model.device}")
print(f"DType: {model.dtype}")

ASR

prompt = {
  "role": "user",
  "content": [
    {"type": "audio", "audio": "/content/m_k.wav"}
  ]
}
chat = ChatState(model, processor)
chat.send_message(prompt)

執行結果

你好，我想看一下最近有什么手机装备。 你好，最近热门的是iPhone 17和Galaxy S26。 你有品牌偏好吗？ 我原本要iPhone 12用得还不错，不过有点卡了。 那iPhone 17是不错的选择，它的金片设计很多，拍照也很清晰，更精致，搭配我们的

AST

prompt = {
  "role": "user",
  "content": [
    {"type": "audio", "audio": "/content/m_k.wav"},
    {"type": "text", "text": "Transcribe this audio into English, and then translate it into French."},
  ]
}
chat = ChatState(model, processor)
chat.send_message(prompt)

執行結果

您好。我想看一下最近有什么手机装备。 您好，最近热门的是iPhone 16和Galaxy S22。 您有品牌偏好吗？ 我原本要iPhone 12用得还不错，不过有点卡了。 那iPhone 16是不错的选择，它的硬件设计很多，拍照也很清晰，搭配我们的。
French:

Bonjour. Je voudrais voir quels sont les équipements de téléphone les plus populaires récemment. Bonjour, les plus populaires récemment sont l'iPhone 16 et le Galaxy S22. Avez-vous une préférence de marque? J'avais l'iPhone 12, et il fonctionnait bien, mais il était un peu lent. L'iPhone 16 est un bon choix, il a beaucoup de matériel conçu, et les photos sont très claires, en harmonie avec notre.

總結

想不到Gemma 3n還可以 ASR , 效果還可以呀

我的 AI 學習之路：第13天 Gemma 與 Gemini - Gemma fine-tuning

我的 AI 學習之路：第15天 Gemma 與 Gemini - Gemma3 圖片

系列文

我的 AI 學習之路：30天 Gemma 與 Gemini 共 30 篇

RSS系列文訂閱系列文

9 人訂閱

完整目錄

熱門推薦

{{ item.channelVendor }} | {{ item.webinarstarted }} |

直播中

尚未有邦友留言

立即登入留言

參賽組數

902 組

團體組數

37 組

累計文章數

19838 篇

完賽人數

529 人

15th鐵人賽 16th鐵人賽 13th鐵人賽 14th鐵人賽 17th鐵人賽 12th鐵人賽 11th鐵人賽鐵人賽 2019鐵人賽 javascript 2018鐵人賽 python 2017鐵人賽 windows php c# linux windows server css react

IT邦幫忙

我的 AI 學習之路：30天 Gemma 與 Gemini系列 第 14 篇

我的 AI 學習之路：第14天 Gemma 與 Gemini - Gemma3n 音訊

Gemma3n

Gemma 3n 語音轉文字

範例code

ASR

執行結果

AST

執行結果

總結

尚未有邦友留言

標記使用者

我的 AI 學習之路：30天 Gemma 與 Gemini系列第 14 篇